Search CORE

13 research outputs found

Space-efficient detection of unusual words

Author: A Apostolico
A Apostolico
CAR Hoare
D Belazzougui
D Belazzougui
J Herold
J Lin
M Crochemore
S Chairungsee
Publication venue
Publication date: 01/01/2015
Field of study

Detecting all the strings that occur in a text more frequently or less frequently than expected according to an IID or a Markov model is a basic problem in string mining, yet current algorithms are based on data structures that are either space-inefficient or incur large slowdowns, and current implementations cannot scale to genomes or metagenomes in practice. In this paper we engineer an algorithm based on the suffix tree of a string to use just a small data structure built on the Burrows-Wheeler transform, and a stack of

O(\sigma^2\log^2 n)

bits, where

n

is the length of the string and

\sigma

is the size of the alphabet. The size of the stack is

o(n)

except for very large values of

\sigma

. We further improve the algorithm by removing its time dependency on

\sigma

, by reporting only a subset of the maximal repeats and of the minimal rare words of the string, and by detecting and scoring candidate under-represented strings that

\textit{do not occur}

in the string. Our algorithms are practical and work directly on the BWT, thus they can be immediately applied to a number of existing datasets that are available in this form, returning this string mining problem to a manageable scale.Comment: arXiv admin note: text overlap with arXiv:1502.0637

arXiv.org e-Print Archive

Crossref

MPG.PuRe

Minimal Forbidden Factors of Circular Words

Author: AJ Pinho
C Barton
C Barton
D Belazzougui
F Mignosi
G Fici
M Béal
M Béal
M Crochemore
M Crochemore
M Crochemore
S Chairungsee
Publication venue
Publication date: 01/01/2017
Field of study

Minimal forbidden factors are a useful tool for investigating properties of words and languages. Two factorial languages are distinct if and only if they have different (antifactorial) sets of minimal forbidden factors. There exist algorithms for computing the minimal forbidden factors of a word, as well as of a regular factorial language. Conversely, Crochemore et al. [IPL, 1998] gave an algorithm that, given the trie recognizing a finite antifactorial language

M

, computes a DFA recognizing the language whose set of minimal forbidden factors is

M

. In the same paper, they showed that the obtained DFA is minimal if the input trie recognizes the minimal forbidden factors of a single word. We generalize this result to the case of a circular word. We discuss several combinatorial properties of the minimal forbidden factors of a circular word. As a byproduct, we obtain a formal definition of the factor automaton of a circular word. Finally, we investigate the case of minimal forbidden factors of the circular Fibonacci words.Comment: To appear in Theoretical Computer Scienc

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Università di Palermo

A framework for space-efficient string kernels

Author: A Apostolico
A Apostolico
AJ Smola
AM İleri
B Chor
D Belazzougui
G Reinert
GE Sims
J Herold
J Qi
J Shawe-Taylor
M Crochemore
R Chikhi
S Chairungsee
Publication venue
Publication date: 23/02/2015
Field of study

String kernels are typically used to compare genome-scale sequences whose length makes alignment impractical, yet their computation is based on data structures that are either space-inefficient, or incur large slowdowns. We show that a number of exact string kernels, like the

k

-mer kernel, the substrings kernels, a number of length-weighted kernels, the minimal absent words kernel, and kernels with Markovian corrections, can all be computed in

O(nd)

time and in

o(n)

bits of space in addition to the input, using just a

\mathtt{rangeDistinct}

data structure on the Burrows-Wheeler transform of the input strings, which takes

O(d)

time per element in its output. The same bounds hold for a number of measures of compositional complexity based on multiple value of

k

, like the

k

-mer profile and the

k

-th order empirical entropy, and for calibrating the value of

k

using the data

arXiv.org e-Print Archive

Crossref

Minimal Absent Words in Rooted and Unrooted Trees

Author: B Schieber
C Barton
D Belazzougui
D Belazzougui
F Mignosi
F Mignosi
F Mignosi
G Fici
G Fici
M Béal
M Béal
M Crochemore
M Crochemore
M Crochemore
M-P Béal
MA Bender
P Charalampopoulos
P Charalampopoulos
RM Silva
S Chairungsee
T Shibuya
Y Almirantis
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

We extend the theory of minimal absent words to (rooted and unrooted) trees, having edges labeled by letters from an alphabet of cardinality. We show that the set of minimal absent words of a rooted (resp. unrooted) tree T with n nodes has cardinality (resp.), and we show that these bounds are realized. Then, we exhibit algorithms to compute all minimal absent words in a rooted (resp. unrooted) tree in output-sensitive time (resp. assuming an integer alphabet of size polynomial in n

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Università di Palermo

Efficient Computing of Longest Previous Reverse Factors

Author: Chairungsee S
Crochemore M
Publication venue
Publication date: 01/01/2009
Field of study

King's Research Portal

Linear-Time Sequence Comparison Using Minimal Absent Words & Applications

Author: A Goios
A Mosig
AJ Pinho
C Acquisti
C Barton
C Barton
D Belazzougui
D Robinson
E Ukkonen
F Mignosi
GM Landau
J Fischer
J Fischer
L Ilie
M Béal
M Crochemore
M Crochemore
M Domazet-Lošo
M Maes
N Saitou
R Grossi
RM Silva
S Chairungsee
SP Garcia
TJ Wheeler
U Manber
W Fletcher
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 22/12/2015
Field of study

Sequence comparison is a prerequisite to virtually all comparative genomic analyses. It is often realized by sequence alignment techniques, which are computationally expensive. This has led to increased research into alignment-free techniques, which are based on measures referring to the composition of sequences in terms of their constituent patterns. These measures, such as q-gram distance, are usually computed in time linear with respect to the length of the sequences. In this article, we focus on the complementary idea: how two sequences can be efficiently compared based on information that does not occur in the sequences. A word is an absent word of some sequence if it does not occur in the sequence. An absent word is minimal if all its proper factors occur in the sequence. Here we present the first linear-time and linear-space algorithm to compare two sequences by considering all their minimal absent words. In the process,we present results of combinatorial interest, and also extend the proposed techniques to compare circular sequences

arXiv.org e-Print Archive

Crossref

King's Research Portal

Archivio istituzionale della ricerca - Università di Palermo

Bacteria classification using minimal absent words

Author: Chairungsee S Crochemore M
Cole JR Wang Q, Cardenas E, et al.
Crochemore M Mignosi F, Restivo A
Fiannaca A La Rosa M, Rizzo R, et al.
Nelson KE Weinstock GM
Specht DF
Publication venue: 'American Institute of Mathematical Sciences (AIMS)'
Publication date: 01/01/2018
Field of study

Crossref

Faster Online Computation of the Succinct Longest Previous Factor Array

Author: A Heliou
AJ Cox
D Okanohara
EM McCreight
F Franěk
G Chen
GS Brodal
J Kärkkäinen
J Ziv
JI Munro
M Alzamel
M Crochemore
M Crochemore
MI Abouelhoda
P Bille
S Chairungsee
SJ Puglisi
T Kasai
U Manber
WK Hon
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

We consider the problem of computing online the Longest Previous Factor array LPF[1, n] of a text T of length n. For each, LPF[i] stores the length of the longest factor of T with at least two occurrences, one ending at i and the other at a previous position. We present an improvement over the previous solution by Okanohara and Sadakane (ESA 2008): our solution uses less space (compressed instead of succinct) and runs in time, thus being faster by a logarithmic factor. As a by-product, we also obtain the first online algorithm computing the Longest Common Suffix (LCS) array (that is, the LCP array of the reversed text) in time and compressed space. We also observe that the LPF array can be represented succinctly in 2n bits. Our online algorithm computes directly the succinct LPF and LCS arrays

Crossref

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Archivio della ricerca- LUISS Libera Università Internazionale degli Studi Sociali Guido Carli di Roma

Efficient algorithms for three variants of the LPF table

Author: Bell
Bender
Böckenhauer
Chairungsee
Chen
Costas S. Iliopoulos
Crochemore
Crochemore
Crochemore
Crochemore
Crochemore
Fischer
Fischer
Franek
Hartman
Kasai
Kim
Ko
Kolpakov
Kärkkäinen
Main
Manacher
Manber
Marcin Kubica
Maxime Crochemore
Nong
Sadakane
Tomasz Waleń
Wojciech Rytter
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref

Minimal Absent Words in a Sliding Window and Applications to On-Line Pattern Matching

Author: AV Aho
B Dömölki
C Barton
C Barton
D Belazzougui
DE Knuth
E Ukkonen
F Mignosi
G Kucherov
G Myers
G Navarro
G Navarro
GM Landau
J Herold
M Crochemore
M Crochemore
M Crochemore
M-P Béal
MS Rahman
RM Silva
S Chairungsee
T Ota
Z Wu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 16/08/2017
Field of study

International audienceAn absent (or forbidden) word of a word y is a word that does not occur in y. It is then called minimal if all its proper factors occur in y. There exist linear-time and linear-space algorithms for computing all minimal absent words of y (Crochemore et al. in Inf Process Lett 67:111–117, 1998; Belazzougui et al. in ESA 8125:133–144, 2013; Barton et al. in BMC Bioinform 15:388, 2014). Minimal absent words are used for data compression (Crochemore et al. in Proc IEEE 88:1756–1768, 2000, Ota and Morita in Theoret Comput Sci 526:108–119, 2014) and for alignment-free sequence comparison by utilizing a metric based on minimal absent words (Chairungsee and Crochemore in Theoret Comput Sci 450:109–116, 2012). They are also used in molecular biology; for instance, three minimal absent words of the human genome were found to play a functional role in a coding region in Ebola virus genomes (Silva et al. in Bioinformatics 31:2421–2425, 2015). In this article we introduce a new application of minimal absent words for on-line pattern matching. Specifically, we present an algorithm that, given a pattern x and a text y, computes the distance between x and every window of size |x| on y. The running time is O(σ|y|)O(σ|y|) , where σσ is the size of the alphabet. Along the way, we show an O(σ|y|)O(σ|y|) -time and O(σ|x|)O(σ|x|) -space algorithm to compute the minimal absent words of every window of size |x| on y, together with some new combinatorial insight on minimal absent words

HAL - Normandie Université

Crossref

King's Research Portal

Hal-Diderot

HAL-Polytechnique

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM